Previous work has demonstrated that an intestinal helminth parasite (Heligmosomoides polygyrus) is able to expand host regulatory T-cell populations through activation of TGF-𝛽 receptors(Johnston et al. 2017). Further work went on to identify a protein, H. polygyrus TGF-β mimic (Hp-TGM), which potently binds host TGF-𝛽 receptors; however, this protein is reported to display little to no sequence homology to human TGF-𝛽.
Objectives
In this notebook, we will explore the similarities and differences between human TGF-𝛽 (hTGF-𝛽) and the H. polygyrus TGF-β mimic (Hp-TGM) using a multi-layered in-silico strategy. First, we will compare amino acid sequences between the two proteins searching for local, global, and motif homology. Second, we will employ alphafold multimer for complex structure prediction, and explore the structures of hTGF-𝛽 + hTGF-𝛽 receptor(s) in comparison to HP-TGM + hTGF-𝛽 receptor(s). We will then further improve and our complex predictions using EvoDock (an evolutionary algorithm application of RosettaDock) and determine binding residues and complex binding energies. These analyses will enable quantitative comparisons between sequence, structure, and docking based computational strategies and ultimately guide the development of a high-throughput in-silico screen for human immune system host-microbe-interactions (HMI).
gget_seq <-function(ensembl_id, amino_acid =TRUE) { seq_results <- gget$seq(ensembl_id, translate = amino_acid)return(seq_results[2])}tgf_keys <-c("TGFB1", "TGFB2", "TGFB3", "TGFBR1", "TGFBR2", "TGFBR3")tgfb_search <- gget$search(tgf_keys, "homo_sapiens")tgfb_search %<>%filter(biotype =="protein_coding", gene_name %in% tgf_keys)# collect amino acid sequences for each protein using the ensembl_idtgfb_search %<>%mutate(sequences_aa =map_chr(ensembl_id, gget_seq))# Manually editing sequences - removing unwanted subdomains within pro-proteinsHP_TMG_SEQ <-"DDSGCMPFSDEAATYKYVAKGPKNIEIPAQIDNSGMYPDYTHVKRFCKGLHGEDTTGWFVGICLASQWYYYEGVQECDDRRCSPLPTNDTVSFEYLKATVNPGIIFNITVHPDASGKYPELTYIKRICKNFPTDSNVQGHIIGMCYNAEWQFSSTPTCPASGCPPLPDDGIVFYEYYGYAGDRHTVGPVVTKDSSGNYPSPTHARRRCRALSQEADPGEFVAICYKSGTTGESHWEYYKNIGKCPDPRCKPLEANESVHYEYFTMTNETDKKKGPPAKVGKSGKYPEHTCVKKVCSKWPYTCSTGGPIFGECIGATWNFTALMECINARGCSSDDLFDKLGFEKVIVRKGEGSDSYKDDFARFYATGSKVIAECGGKTVRLECSNGEWHEPGTKTVHRCTKDGIRTL"TGFB1 <-"ALDTNYCFSSTEKNCCVRQLYIDFRKDLGWKWIHEPKGYHANFCLGPCPYIWSLDTQYSKVLALYNQHNPGASAAPCCVPQALEPLPIVYYVGRKPKVEQLSNMIVRSCKCS"TGFB2 <-"ALDAAYCFRNVQDNCCLRPLYIDFKRDLGWKWIHEPKGYNANFCAGACPYLWSSDTQHSRVLSLYNTINPEASASPCCVSQDLEPLTILYYIGKTPKIEQLSNMIVKSCKCS"TGFB3 <-"ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS"hp_mimic_data <-tribble(~gene_name, ~sequences_aa,"HP-TGM", HP_TMG_SEQ )tgfb_search %<>%bind_rows(hp_mimic_data)saveRDS(tgfb_search, glue("{wkdir}/data/interim/tmp/","{Sys.Date()}_TGFB_gget.rds"))
Saving individual sequences in FASTA format.
Code
# Save individual fasta files for each protein -----tgfb_search <-readRDS(glue("{wkdir}/data/interim/tmp/","2023-02-27_TGFB_gget.rds"))# Manually editing the fasta sequences to remove the signal peptide and other non-active domainstgfb_search %<>%mutate(sequences_aa =case_when( gene_name =="TGFB1"~ TGFB1, gene_name =="TGFB2"~ TGFB2, gene_name =="TGFB3"~ TGFB3,TRUE~ sequences_aa ))fastas_raw_dir <-glue("{wkdir}/data/interim/fastas/raw/monomers")for (tf in tgfb_search$gene_name) { tf_aa <- tgfb_search %>%filter(gene_name == tf) %>%pull(sequences_aa) fasta_path <-glue("{fastas_raw_dir}/{tf}.fasta")write.fasta(sequences = tf_aa,names = tf,file.out = fasta_path,open ="w",nbchar =60,as.string =FALSE )}
Applying SignalP 6.0 to predict and remove signal peptides from fasta files.
Code
fastas_raw_paths <-paste0( fastas_raw_dir, "/", list.files(fastas_raw_dir, pattern ="fasta"))names(fastas_raw_paths) <- fastas_raw_paths %>% purrr::map(~fs::path_ext_remove(basename(.)))fastas_processed_tgfb <-list()for (id innames(fastas_raw_paths)) { fasta_path <- fastas_raw_paths[[id]] fastas_processed_tgfb[[id]] <-glue("{wkdir}/data/interim/fastas/processed/monomers/{id}" )if (dir.exists(fastas_processed_tgfb[[id]])) {next } message("Processing ", id, " fasta file...")slurm_shell_do(glue("conda run -n signalp6"," signalp6"," --organism eukarya"," --fastafile {fasta_path}"," --output_dir {fastas_processed_tgfb[[id]]}"," --write_procs 8"," --format all"," --mode fast" ),ncpus =8 )}# IF no signal peptide is found, the fasta file is copied to the processed dirfor (id innames(fastas_raw_paths)) { raw_fasta <- fastas_raw_paths[[id]] processed_fasta <-glue("{fastas_processed_tgfb[[id]]}/","processed_entries.fasta" )if (file.info(processed_fasta)$size ==0){message("Overwriting ", id, " processed fasta file...")shell_do(glue("cp {raw_fasta} {processed_fasta}")) }}
Figure 1: Local and global pairwise sequence alignment.
This figure depicts pairwise sequence alignment scores between all human TGF-β isomers I-III and the H. polygyrus TGF-β mimic (Hp-TGM). Color of the heatmap represents the alignment score and the top and bottom values within each cell indicate the percent identity and similarity of a sequence pair, respectively. These results indicate that human TGF-β isomers share a moderate degree of sequence similarity within isomer comparisons, but very little similarity with HP-TGM. This is indicated by a decrease of approximately two-fold in the percentage similarity observed in inter-species comparisons.
Hidden Markov Model Similarity
Creating MSA profiles for each protein using jackhmmer against UniProt.
Figure 2: HP-TGM profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin.
Code
ggplotly(hmmer_hit_plots$TGFB1)
Figure 3: TGF-β1 profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin.
Code
ggplotly(hmmer_hit_plots$TGFB2)
Figure 4: TGF-β2 profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin.
Code
ggplotly(hmmer_hit_plots$TGFB3)
Figure 5: TGF-β3 profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin.
Key Conclusions:
HP-TGM shares little to no sequence similarity with human human TGF-β isomers as indicated by local and global alignment.
When analyzing sequence motifs, HP-TGM displays similarity to Complement Control Protein (CCP, or Sushi) family domains, suggesting an entirely alternative origin than human TGF-β.
Signal Peptide Removal SignalP 6.0 (fast) was used to predict and trim signal peptides. This prediction prediction model is based on a Bert protein language model encoder and a conditional random field (CRF) decoder.
Sequence Alignment Needleman-Wunsch and Smith-Waterman algorithms were applied using emboss v6.6.0.
profile Hidden Markov Model Analysis profile Hidden Markov Models were created using jackhmmer from HMMER v.3.3, with a maximum of 5 iterations and an E-value threshold of 0.05 against the latest UniProt catalog (as of 2023-03-02). Hidden models were assembled via hmmbuild and hmmsearch was applied against our custom microbial catalog database.
Protein Complex Structure Prediction Alphafold v2.2.0 was applied using the “multimer” setting.
Johnston, Chris J. C., Danielle J. Smyth, Ravindra B. Kodali, Madeleine P. J. White, Yvonne Harcus, Kara J. Filbey, James P. Hewitson, et al. 2017. “A Structurally Distinct TGF-β Mimic from an Intestinal Helminth Parasite Potently Induces Regulatory t Cells.”Nature Communications 8 (November): 1741. https://doi.org/10.1038/s41467-017-01886-6.
Source Code
---title: "An *in-silico* exploration of an evolutionarily convergent TGF-𝛽 mimetic from a parasitic nematode *Heligmosomoides polygyrus bakeri*"bibliography: references.bibeditor: visualauthor: "Joe Boktor"date: '2023-02-10'format: html: font-family: helvetica neueu page-layout: full toc: true toc-location: left toc-depth: 3 self-contained: true code-fold: true code-tools: true fig-align: center---## BackgroundPrevious work has demonstrated that an intestinal helminth parasite (*Heligmosomoides polygyrus)* is able to expand host regulatory T-cell populations through activation of TGF-𝛽 receptors[@johnston2017]. Further work went on to identify a protein, *H. polygyrus* TGF-β mimic (*Hp-*TGM), which potently binds host TGF-𝛽 receptors; however, this protein is reported to display little to no sequence homology to human TGF-𝛽. ## ObjectivesIn this notebook, we will explore the similarities and differences between human TGF-𝛽 (hTGF-𝛽) and the *H. polygyrus* TGF-β mimic (*Hp-*TGM) using a multi-layered *in-silico* strategy. First, we will compare amino acid sequences between the two proteins searching for local, global, and motif homology. Second, we will employ alphafold multimer for complex structure prediction, and explore the structures of hTGF-𝛽 + hTGF-𝛽 receptor(s) in comparison to HP-TGM + hTGF-𝛽 receptor(s). We will then further improve and our complex predictions using EvoDock (an evolutionary algorithm application of RosettaDock) and determine binding residues and complex binding energies. These analyses will enable quantitative comparisons between sequence, structure, and docking based computational strategies and ultimately guide the development of a high-throughput *in-silico* screen for human immune system host-microbe-interactions (HMI).## AnalysisSetup```{r}#| warning: falselibrary(tidyverse)library(reticulate)library(magrittr)library(glue)library(seqinr)library(future)library(batchtools)library(future.batchtools)library(fs)library(tictoc)library(listenv)library(progress)library(viridis)library(bio3d)library(r3dmol)library(strex)library(plotly)library(ggsci)tmpdir <-"/central/scratch/jbok/tmp"homedir <-"/central/groups/MazmanianLab/joeB"wkdir <-glue("{homedir}/","Microbiota-Immunomodulation/Microbiota-Immunomodulation")src_dir <-glue("{wkdir}/notebooks")source(glue("{src_dir}/R-scripts/helpers_general.R"))source(glue("{src_dir}/R-scripts/helpers_sequence-screens.R"))source(glue("{src_dir}/R-scripts/rhmmer-package.R"))protein_catalogs <-glue("{homedir}/", "Downloads/protein_catalogs")shell_do(glue("mkdir -p {wkdir}/data/interim/fastas/raw/monomers"))shell_do(glue("mkdir -p {wkdir}/data/interim/fastas/processed/monomers"))shell_do(glue("mkdir -p {wkdir}/data/interim/fastas/processed/complexes"))``````{r, eval = FALSE}gget_seq <-function(ensembl_id, amino_acid =TRUE) { seq_results <- gget$seq(ensembl_id, translate = amino_acid)return(seq_results[2])}tgf_keys <-c("TGFB1", "TGFB2", "TGFB3", "TGFBR1", "TGFBR2", "TGFBR3")tgfb_search <- gget$search(tgf_keys, "homo_sapiens")tgfb_search %<>%filter(biotype =="protein_coding", gene_name %in% tgf_keys)# collect amino acid sequences for each protein using the ensembl_idtgfb_search %<>%mutate(sequences_aa =map_chr(ensembl_id, gget_seq))# Manually editing sequences - removing unwanted subdomains within pro-proteinsHP_TMG_SEQ <-"DDSGCMPFSDEAATYKYVAKGPKNIEIPAQIDNSGMYPDYTHVKRFCKGLHGEDTTGWFVGICLASQWYYYEGVQECDDRRCSPLPTNDTVSFEYLKATVNPGIIFNITVHPDASGKYPELTYIKRICKNFPTDSNVQGHIIGMCYNAEWQFSSTPTCPASGCPPLPDDGIVFYEYYGYAGDRHTVGPVVTKDSSGNYPSPTHARRRCRALSQEADPGEFVAICYKSGTTGESHWEYYKNIGKCPDPRCKPLEANESVHYEYFTMTNETDKKKGPPAKVGKSGKYPEHTCVKKVCSKWPYTCSTGGPIFGECIGATWNFTALMECINARGCSSDDLFDKLGFEKVIVRKGEGSDSYKDDFARFYATGSKVIAECGGKTVRLECSNGEWHEPGTKTVHRCTKDGIRTL"TGFB1 <-"ALDTNYCFSSTEKNCCVRQLYIDFRKDLGWKWIHEPKGYHANFCLGPCPYIWSLDTQYSKVLALYNQHNPGASAAPCCVPQALEPLPIVYYVGRKPKVEQLSNMIVRSCKCS"TGFB2 <-"ALDAAYCFRNVQDNCCLRPLYIDFKRDLGWKWIHEPKGYNANFCAGACPYLWSSDTQHSRVLSLYNTINPEASASPCCVSQDLEPLTILYYIGKTPKIEQLSNMIVKSCKCS"TGFB3 <-"ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS"hp_mimic_data <-tribble(~gene_name, ~sequences_aa,"HP-TGM", HP_TMG_SEQ )tgfb_search %<>%bind_rows(hp_mimic_data)saveRDS(tgfb_search, glue("{wkdir}/data/interim/tmp/","{Sys.Date()}_TGFB_gget.rds"))```Saving individual sequences in FASTA format.```{r, eval=FALSE}# Save individual fasta files for each protein -----tgfb_search <-readRDS(glue("{wkdir}/data/interim/tmp/","2023-02-27_TGFB_gget.rds"))# Manually editing the fasta sequences to remove the signal peptide and other non-active domainstgfb_search %<>%mutate(sequences_aa =case_when( gene_name =="TGFB1"~ TGFB1, gene_name =="TGFB2"~ TGFB2, gene_name =="TGFB3"~ TGFB3,TRUE~ sequences_aa ))fastas_raw_dir <-glue("{wkdir}/data/interim/fastas/raw/monomers")for (tf in tgfb_search$gene_name) { tf_aa <- tgfb_search %>%filter(gene_name == tf) %>%pull(sequences_aa) fasta_path <-glue("{fastas_raw_dir}/{tf}.fasta")write.fasta(sequences = tf_aa,names = tf,file.out = fasta_path,open ="w",nbchar =60,as.string =FALSE )}```Applying [SignalP 6.0](https://github.com/fteufel/signalp-6.0) to predict and remove signal peptides from fasta files.```{r, eval=FALSE}fastas_raw_paths <-paste0( fastas_raw_dir, "/", list.files(fastas_raw_dir, pattern ="fasta"))names(fastas_raw_paths) <- fastas_raw_paths %>% purrr::map(~fs::path_ext_remove(basename(.)))fastas_processed_tgfb <-list()for (id innames(fastas_raw_paths)) { fasta_path <- fastas_raw_paths[[id]] fastas_processed_tgfb[[id]] <-glue("{wkdir}/data/interim/fastas/processed/monomers/{id}" )if (dir.exists(fastas_processed_tgfb[[id]])) {next } message("Processing ", id, " fasta file...")slurm_shell_do(glue("conda run -n signalp6"," signalp6"," --organism eukarya"," --fastafile {fasta_path}"," --output_dir {fastas_processed_tgfb[[id]]}"," --write_procs 8"," --format all"," --mode fast" ),ncpus =8 )}# IF no signal peptide is found, the fasta file is copied to the processed dirfor (id innames(fastas_raw_paths)) { raw_fasta <- fastas_raw_paths[[id]] processed_fasta <-glue("{fastas_processed_tgfb[[id]]}/","processed_entries.fasta" )if (file.info(processed_fasta)$size ==0){message("Overwriting ", id, " processed fasta file...")shell_do(glue("cp {raw_fasta} {processed_fasta}")) }}```Saving FASTA files for complex predictions```{r, eval=FALSE}# Saving protein complexes (post-processing) ----tgfb_search <-readRDS(glue("{wkdir}/data/interim/tmp/","2023-02-27_TGFB_gget.rds"))# Split into TFs and Receptors and combine all possible pairstgf_receptors <- tgfb_search %>%filter(grepl("receptor", ensembl_description)) %>%pull(gene_name)tgf_tf <- tgfb_search %>%filter(!grepl("receptor", ensembl_description)) %>%pull(gene_name)format_seq <-function(fasta) { fasta %>% seqinr::getSequence() %>%unlist() %>%paste(collapse ="")}# Saving paired fasta filesfastas_merged_path <-glue("{wkdir}/data/interim/fastas/processed/complexes")for (tf in tgf_tf) { tf_aa <- seqinr::read.fasta(glue("{fastas_processed_tgfb[[tf]]}/processed_entries.fasta"),seqtype ="AA" )for (receptor in tgf_receptors) { rec_aa <- seqinr::read.fasta(glue("{fastas_processed_tgfb[[receptor]]}/processed_entries.fasta"),seqtype ="AA" ) merged_name <-paste( seqinr::getName(tf_aa), seqinr::getName(rec_aa), sep ="_" ) merged_fasta_path <-glue("{fastas_merged_path}/{merged_name}.fasta")if (file.exists(merged_fasta_path)) {next }message("Saving: ", merged_name)write.fasta(sequences =list(format_seq(tf_aa), format_seq(rec_aa)),names =list(seqinr::getName(tf_aa), seqinr::getName(rec_aa)),file.out = merged_fasta_path,open ="w",nbchar =60,as.string =FALSE ) }}# Saving trioed fasta filesrec1 <- seqinr::read.fasta(glue("{fastas_processed_tgfb[['TGFBR1']]}/processed_entries.fasta"),seqtype ="AA" )rec2 <- seqinr::read.fasta(glue("{fastas_processed_tgfb[['TGFBR2']]}/processed_entries.fasta"),seqtype ="AA" )for (tf in tgf_tf) { tf_aa <- seqinr::read.fasta(glue("{fastas_processed_tgfb[[tf]]}/processed_entries.fasta"),seqtype ="AA" ) merged_name <-paste(seqinr::getName(tf_aa), "TGFBR1_TGFBR2", sep ="_") merged_fasta_path <-glue("{fastas_merged_path}/{merged_name}.fasta")if (file.exists(merged_fasta_path)) {next }message("Saving: ", merged_name)write.fasta(sequences =list(format_seq(tf_aa), format_seq(rec1),format_seq(rec2) ),names =list( seqinr::getName(tf_aa), seqinr::getName(rec1), seqinr::getName(rec2) ),file.out = merged_fasta_path,open ="w",nbchar =60,as.string =FALSE )}```### Sequence Similarity (Local & Global)Creating a dataframe of job metadata for slurm batchtools.```{r, eval=FALSE}sequence_analysis_dir <-glue("{wkdir}/data/interim/","tgfb_analyses/sequence-alignment")tgfb_ligand_dir <-glue("{wkdir}/data/interim/alphafold-multimer/TGFB_ligands")shell_do(glue("mkdir -p {sequence_analysis_dir}"))shell_do(glue("mkdir -p {tgfb_ligand_dir}"))seqinr::write.fasta(sequences =as.list(tgf_tf$sequences_aa),names =as.list(tgf_tf$gene_name),file.out =glue("{tgfb_ligand_dir}/tgfb_ligands.fasta"),open ="w",nbchar =60,as.string =FALSE)alignment_df_list <-bind_rows( tgf_tf %>%mutate(ensembl_id = gene_name) %>% dplyr::select(ensembl_id, sequences_aa) %>%mutate(method ="needle"), tgf_tf %>%mutate(ensembl_id = gene_name) %>% dplyr::select(ensembl_id, sequences_aa) %>%mutate(method ="water")) %>%mutate(refdb_path =glue("{tgfb_ligand_dir}/tgfb_ligands.fasta"),output_dir = sequence_analysis_dir )alignment_df_list %>%glimpse()```Executing slurm sequence alignment jobs.```{r, eval=FALSE}# configure registry ----cluster_run <-glue("{Sys.Date()}_TGFB-EMBOSS-Alignment_ID-{rand_string()}/")message("\n\nRUNNING: ", cluster_run, "\n")breg <-makeRegistry(file.dir =glue("{wkdir}/.cluster_runs/", cluster_run ),seed =42)breg$cluster.functions <- batchtools::makeClusterFunctionsSlurm(template =glue("{wkdir}/batchtools_templates/batchtools.slurm.tmpl"),scheduler.latency =0.5,fs.latency =65)# Submit Jobs ----jobs <-batchMap(fun = align_fasta_sequences,args = alignment_df_list,reg = breg)setJobNames(jobs,paste("TGFB-EMBOSS", alignment_df_list$ensembl_id, alignment_df_list$method,sep ="_"),reg = breg)getJobNames(jobs, reg = breg)submitJobs(jobs,resources =list(walltime =3600,memory =1024,ncpus =1,max.concurrent.jobs =9999 ))```Aggregating alignment files.```{r, eval=FALSE}file_paths <-list.files(sequence_analysis_dir, pattern ="rds$", full.names =TRUE)needle_files <- file_paths %>%keep(grepl("HP-TGM_tgfb_ligands.needle.rds", .))water_files <- file_paths %>%keep(grepl("HP-TGM_tgfb_ligands.water.rds", .))tgfb_alignments <-bind_rows( file_paths %>%keep(grepl("needle", .)) %>%load_rds_files() %>%mutate(method ="Needleman-Wunsch (global) "), file_paths %>%keep(grepl("water", .)) %>%load_rds_files() %>%mutate(method ="Smith-Waterman (local)") )saveRDS(tgfb_alignments,glue("{wkdir}/data/interim/tgfb_analyses/sequence-alignment-df.rds") )``````{r}tgfb_alignments <-readRDS(glue("{wkdir}/data/interim/tgfb_analyses/sequence-alignment-df.rds") )# Plotting sequence alignment dataalignment_heatmap <- tgfb_alignments %>%ggplot(aes(x = protein_1, y = protein_2, fill = Score)) +geom_tile() +geom_text(aes(label =paste0(percent_identity, "%\n", percent_similarity, "%")), color ="white", size =3) +theme_bw() +facet_wrap( ~ method) +scale_x_discrete(expand=c(0,0)) +scale_y_discrete(expand=c(0,0)) +scale_fill_viridis_c(option ="G") +labs(x ="", y ="", fill ="Alignment\nScore") +theme(axis.text.x =element_text(angle=45, hjust=1) )ggsave(alignment_heatmap, filename =glue("{wkdir}/figures/tgfb-analysis/sequence-alignment-heatmap.png"),dpi =600, width =7, height =3)```{#fig-emboss}This figure depicts pairwise sequence alignment scores between all human TGF-β isomers I-III and the H. polygyrus TGF-β mimic (Hp-TGM). Color of the heatmap represents the alignment score and the top and bottom values within each cell indicate the percent identity and similarity of a sequence pair, respectively. These results indicate that human TGF-β isomers share a moderate degree of sequence similarity within isomer comparisons, but very little similarity with HP-TGM. This is indicated by a decrease of approximately two-fold in the percentage similarity observed in inter-species comparisons.### Hidden Markov Model Similarity Creating MSA profiles for each protein using jackhmmer against UniProt.```{r, eval=FALSE}tgfb_fasta_dir <-glue("{wkdir}/data/interim/alphafold-multimer/TGFB_fasta")tgfb_fasta_paths <-paste0( tgfb_fasta_dir, "/", list.files(tgfb_fasta_dir, pattern ="fasta") %>%keep(!grepl("_", .)))for (fasta_path in tgfb_fasta_paths) { id <- fs::path_ext_remove(basename(fasta_path)) output_dir <-glue("{wkdir}/data/processed/","jackhmmer_results/{id}" )shell_do(glue("mkdir -p {output_dir}"))if (!file.exists(glue("{output_dir}/{id}_msa.fasta"))) {slurm_shell_do(cmd =glue("jackhmmer "," --cpu 8"," -N 5"," -E 0.05"," -A {output_dir}/{id}_msa.fasta"," -o {output_dir}/{id}_jackhmmer.out"," --tblout {output_dir}/{id}_seqhits.txt"," --domtblout {output_dir}/{id}_domainhits.txt"," --noali"," {fasta_path}"," /central/software/alphafold/data/uniprot/uniprot.fasta" ),jobname =glue("jkhmr_{fasta_path}_{rand_string()}"),working_dir = wkdir,ncpus =8,memory ="4G",walltime =36000 ) }}```Creating HMM profiles from MSAs and estimating sequence similiarity using Hmmsearch on custom protein catalog.```{r, eval=FALSE}protein_catalog_path <-glue("{homedir}/Downloads/protein_catalogs/clustered_catalogs","/merged/2023-02-13_catalog_MMSeq2-95_rep_seq.fasta")ids <- tgfb_fasta_paths %>% purrr::map(~ fs::path_ext_remove(basename(.)))for (id in ids){ input_dir <-glue("{wkdir}/data/processed/","jackhmmer_results/{id}" ) output_dir <-glue("{wkdir}/data/processed/","hmmersearch_results/{id}" )shell_do(glue("mkdir -p {output_dir}"))# make hmm profile from MSA fasta_path <-glue("{input_dir}/{id}_msa.fasta")shell_do(glue("hmmbuild {output_dir}/{id}.hmm {fasta_path}"))# hmmsearch on protein catalog using hmm profileslurm_shell_do(cmd =glue("hmmsearch "," --cpu 8"," -A {output_dir}/{id}_msa.hmm"," -o {output_dir}/{id}_hmmsearch.out"," --tblout {output_dir}/{id}_tblout.txt"," --domtblout {output_dir}/{id}_domtblout.txt"," --noali"," {output_dir}/{id}.hmm"," {protein_catalog_path}" ),jobname =glue("hmrsearch_{fasta_path}_{rand_string()}"),working_dir = wkdir,ncpus =8,memory ="4G",walltime =36000 )}```Visualizing pHMM results.```{r}catalogs <-c("ELGG","Hadza","KIJ","MGV","RUGS","RUMMETA","UHGP","VEuPathDB","Wormbase")catalogs_cols <-pal_npg()(length(catalogs))names(catalogs_cols) <- catalogshmmer_results_dir <-glue("{wkdir}/data/interim/tgfb_analyses/hmmersearch_results/")hmmer_results_paths <-list.dirs(glue("{wkdir}/data/interim/tgfb_analyses/hmmersearch_results/"), recursive =FALSE)hmmer_hit_plots <-list()for (id inbasename(hmmer_results_paths)) { sequence_hits <-read_tblout(glue("{hmmer_results_dir}/{id}/{id}_tblout.txt") ) hmmer_hit_plots[[id]] <- sequence_hits %>%mutate(catalog =map_chr(str_before_nth(domain_name, "_", 2), ~str_remove(., "CATID_")) ) %>%ggplot(aes(x =-log10(sequence_evalue), y =-log10(best_domain_evalue),group = description )) +scale_color_manual(values = catalogs_cols) +geom_point(aes(color = catalog)) +theme_bw()}```::: panel-tabset#### HP-TGM```{r}#| label: fig-HMMER1#| fig-cap: "HP-TGM profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin."ggplotly(hmmer_hit_plots$`HP-TGM` )```#### TGF-β1```{r}#| label: fig-HMMER2#| fig-cap: "TGF-β1 profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin."ggplotly(hmmer_hit_plots$TGFB1)```#### TGF-β2```{r}#| label: fig-HMMER3#| fig-cap: "TGF-β2 profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin."ggplotly(hmmer_hit_plots$TGFB2)```#### TGF-β3```{r}#| label: fig-HMMER4#| fig-cap: "TGF-β3 profile Hidden Markov Model hits (E value <= 0.05) against our custom microbial protein catalog. The x and y axes display E-value signifiance for the overall sequence and the best domain, respectively. Points are colored by the subcatalog of origin."ggplotly(hmmer_hit_plots$TGFB3)```:::### **Key Conclusions:**- HP-TGM shares little to no sequence similarity with human human TGF-β isomers as indicated by local and global alignment.- When analyzing sequence motifs, HP-TGM displays similarity to Complement Control Protein (CCP, or Sushi) family domains, suggesting an entirely alternative origin than human TGF-β.### Structural Analysis and Complex PredictionsExecuting alpha-fold multimer models```{r, eval=FALSE}cluster_reports <-glue("{wkdir}/.cluster_runs/","{Sys.Date()}_AlphaFold-Multimer_TGFB-trimmedseqs_{rand_string()}")message("\n\n CREATING: ", cluster_reports, "\n")shell_do(glue("mkdir -p {cluster_reports}"))complex_fastas <-list.files(glue("{wkdir}/data/interim/fastas/processed/complexes"),# glue("{wkdir}/data/interim/alphafold-multimer/TGFB_fasta"),full.names =TRUE )for (complex_path in complex_fastas) { jname <- fs::path_ext_remove(basename(complex_path))slurm_run_alphafold(jobname = jname,slurm_out = cluster_reports,input_fasta = complex_path,mem ="128G",gpus =1,cpus_per_task =8,output_dir =glue("{wkdir}/data/interim/alphafold-multimer/2023-02-28_TGFB_complexes/{jname}") )}``````{r, eval=FALSE}receptor_ligand_pairs <-c("TGFB1_TGFBR1","TGFB2_TGFBR2","TGFB3_TGFBR1_TGFBR2","TGFB2_TGFBR1","TGFB2_TGFBR2","TGFB2_TGFBR1_TGFBR2","TGFB3_TGFBR1","TGFB3_TGFBR2","TGFB3_TGFBR1_TGFBR2","HP-TGM_TGFBR1","HP-TGM_TGFBR2","HP-TGM_TGFBR1_TGFBR2")pdb_models <-list()for (pair in receptor_ligand_pairs) { pdb_data <- bio3d::read.pdb(glue("{wkdir}/data/interim/alphafold-multimer/","2023-02-28_TGFB_complexes/{pair}/{pair}","/ranked_0.pdb" ),multi =TRUE ) pdb_models[[pair]] <-r3dmol(viewer_spec =m_viewer_spec(cartoonQuality =2,lowerZoomLimit =50,upperZoomLimit =2000 ) ) %>%m_add_model(data =m_bio3d(pdb_data)) %>%m_add_surface(style =m_style_surface(opacity =0.4)) %>%m_set_style(style =m_style_cartoon()) %>%m_set_style(sel =m_sel(chain ="B"),style =m_style_cartoon(color ="#FFD105")) %>%m_set_style(sel =m_sel(chain ="C"),style =m_style_cartoon(color ="#FC027E")) %>%m_zoom_to() %>%m_rotate(angle =90, axis ="y") %>%m_spin()}saveRDS(pdb_models,glue("{wkdir}/data/interim/tgfb_analyses/{Sys.Date()}_r3dmol-complexes.rds") )``````{r}pdb_models <-readRDS(glue("{wkdir}/data/interim/tgfb_analyses/2023-03-02_r3dmol-complexes.rds") )```::: panel-tabset#### [TGF-β3 - TGF-βR1]```{r}pdb_models$TGFB3_TGFBR1```#### [TGF-β3 - TGF-βR2]```{r}pdb_models$TGFB3_TGFBR2```#### [TGF-β3 - TGF-βR1 - TGF-βR2] ```{r}pdb_models$TGFB3_TGFBR1_TGFBR2```#### [HP-TGM - TGF-βR1]```{r}pdb_models$`HP-TGM_TGFBR1````#### [HP-TGM - TGF-βR2]```{r}pdb_models$`HP-TGM_TGFBR2````#### [HP-TGM - TGF-βR1 - TGF-βR2]```{r}pdb_models$`HP-TGM_TGFBR1_TGFBR2````:::### Complex Binding EnergiesIn-progress### Binding Interface AnalysisIn-progress## Methods**Signal Peptide Removal**SignalP 6.0 (fast) was used to predict and trim signal peptides. This prediction prediction model is based on a Bert protein language model encoder and a conditional random field (CRF) decoder.**Sequence Alignment**Needleman-Wunsch and Smith-Waterman algorithms were applied using [emboss](https://anaconda.org/bioconda/emboss) v6.6.0.**profile Hidden Markov Model Analysis**profile Hidden Markov Models were created using jackhmmer from HMMER v.3.3, with a maximum of 5 iterations and an E-value threshold of 0.05 against the latest UniProt catalog (as of 2023-03-02). Hidden models were assembled via hmmbuild and hmmsearch was applied against our custom microbial catalog database.**Protein Complex Structure Prediction**Alphafold v2.2.0 was applied using the "multimer" setting. **R-Packages**```{r}sessionInfo()```